08. Exercise: Buffered Streams
External Sort: Part 1
Your employer, Udacisearch, is a new SEO startup that scrapes the internet for various words and search terms. Udacisearch has a batch process that dumps all the words it scraped into a single text file, unsorted.txt
. Your job is to create a new file that contains all the same words as unsorted.txt
, but in the new file they should be sorted alphabetically.
There's just one problem — the text file is so big it won't even fit into your computer's memory! (Note: For the purposes of this exercise, we are only sorting 10,000 words, so they in fact do fit in memory. Just pretend they don't!)
To deal with this limitation, you will use a technique known as External Sorting, which is a way to sort large amounts of data. At a high level, this will involve
- sharding the large file, which means splitting it up into a bunch of smaller files that can individually fit into memory;
- sorting the shard files; and
- merging the sorted shards into a single large sorted file. The merge can be done in a way that does not require loading all the sorted words into memory at once.
In this exercise, you will be implementing Steps 1 and 2 (Step 3 will be done in a later exercise). Don't worry! You don't have to implement your own sorting algorithm to sort the individual files: you can just use List#sort()
or Stream#sorted()
to do the job.
For this exercise, you should focus on the file operations that are required.
You should read the unsorted words from the input
Path
, line by line, using a BufferedReader
. Write the input words to the many shard files. Each shard file should contain at most SHARD_SIZE
words, in sorted order. Remember, just use List#sort()
, Stream#sorted()
, or Collections.sort()
for this part. All the words should be accounted for in the output shard files; you should not skip any words.
Once you are done sorting the words for a shard, write the sorted shard file to the outputFolder
Path
, using the provided getOutputFileName(int)
method to name the individual shard files. You should use a BufferedWriter
to do the writing.
Running the Solution
Finally, compile and run your solution to test it!
javac MakeShards.java
java MakeShards unsorted.txt shards/
You should see 100 shard files appear in the shards
directory. Remember that the words in each shard file will be sorted within the context of that file, but the words are not necessarily sorted across shard files. That doesn't happen until the next exercise, when you will merge the shards together.
Good luck!
TODO List
Task Feedback:
Nice work!
Code
If you need a code on the https://github.com/udacity.
export PATH=/data/jdk-15.0.1/bin:$PATH
export JAVA_HOME=/data/jdk-15.0.1/bin